Problem 1: Use the color picker app from the colorspace package (colorspace::choose_color()) to create a qualitative color scale containing five colors. One of the five colors should be #5C9E76, so you need to find four additional colors that go with this one.

colors <- c("#5C9E76", "#E97451", "#3D6FFF", "#963DFF", "#FF3D6C")
swatchplot(colors)

For the rest of this homework, we will be working with the midwest_clean dataset, which is a cleaned up version of the ggplot2 midwest dataset.

midwest_clean <- midwest %>% 
  select(
    state, county, area, popdensity, percbelowpoverty, inmetro
  ) %>%        # keep only a subset of data
  na.omit()    # remove any rows with missing data
head(midwest_clean)
## # A tibble: 6 Ă— 6
##   state county     area popdensity percbelowpoverty inmetro
##   <chr> <chr>     <dbl>      <dbl>            <dbl>   <int>
## 1 IL    ADAMS     0.052      1271.            13.2        0
## 2 IL    ALEXANDER 0.014       759             32.2        0
## 3 IL    BOND      0.022       681.            12.1        0
## 4 IL    BOONE     0.017      1812.             7.21       1
## 5 IL    BROWN     0.018       324.            13.5        0
## 6 IL    BUREAU    0.05        714.            10.4        0

Problem 2: Perform a PCA of the midwest_clean dataset and make a rotation plot of components 1 and 2.

Below shows a summary of the PCA performed on the midwest_clean dataset:

midwest_pca <- na.omit(midwest_clean) %>%
  select(where(is.numeric)) %>% # retain only numeric columns
  select(-c("inmetro")) %>%  # remove categorical variable
  scale() %>%                   # scale to zero mean and unit variance
  prcomp()            # performing principal component analysis

aug_midwest <- midwest_pca %>%
  augment(na.omit(midwest_clean))

summary(midwest_pca)
## Importance of components:
##                           PC1    PC2    PC3
## Standard deviation     1.0597 0.9750 0.9625
## Proportion of Variance 0.3743 0.3169 0.3088
## Cumulative Proportion  0.3743 0.6912 1.0000
arrow_style <- arrow(
  angle = 20, length = grid::unit(6, "pt"),
  ends = "first", type = "closed"
) 

midwest_pca %>%
  # extract rotation matrix
  tidy(matrix = "rotation") %>%
  pivot_wider(
    names_from = "PC", values_from = "value",
    names_prefix = "PC"
  ) %>%
  ggplot(aes(PC1, PC2,color = factor(column))) +
  geom_segment(
    xend = 0, yend = 0,
    arrow = arrow_style, 
    size = 1, alpha = 1
  ) +
  ggtitle("Rotation Matrix") +
  labs( #adding labels
    x = "Principal Component 1",
    y = "Principal Component 2",
    subtitle = "Figure 1"
    ) +
  xlim(-0.65, 0.65) + 
  ylim(-0.15, 1) +
  scale_color_manual(
    name = "Variables",
    values = colors, #custom palette created in question 1
    guide = guide_legend(nrow=1)
    ) +
  theme_bw( #adding a theme for visualization 
  ) + 
  theme( #aesthetics
    legend.position = "top",
    axis.line = element_line(colour = "black"), 
    panel.border = element_blank(),
    panel.background = element_blank()
    ) 

Problem 3: Make a scatter plot of PC 2 versus PC 1 and color by state. You should use the custom colorscale you created in Problem 1. Then use the rotation plot from Problem 2 to describe where Chicago, Illinois can be found on the scatter plot. Provide any additional evidence used to support your answer.

ggplot(
  aug_midwest,
  aes(.fittedPC1, .fittedPC2, color = state)
  ) +
  geom_point(
    size = 3,
    alpha = 0.75,
    shape = 16
    ) +
  ggtitle("Midwest Principal Components") +
  labs( #adding labels
    x = "Principal Component 1",
    y = "Principal Component 2",
    subtitle = "Figure 2",
    caption = "Chicago, Illinois is in Cook County, shown and labeled in the top right corner of the scatter plot." 
    ) +
  scale_color_manual(
    name = "State",
    values = colors, #custom palette created in question 1
    ) +
   guides(
     color = guide_legend(override.aes=list(shape = 16, size = 3, alpha = 0.75))
     ) +
   geom_point( 
    data = filter(aug_midwest,county == "COOK"),
    shape = 13,
    size = 5
  ) +
  geom_text(
    data=subset(aug_midwest,county == "COOK"),
    aes(label="Cook County",hjust = 1.13, vjust = 0.5),
    show.legend = FALSE,
    color = "black"
    ) +
  theme_bw( #adding a theme for visualization 
  ) + 
  theme( #aesthetics
    legend.position = "top",
    axis.line = element_line(colour = "black"), 
    panel.border = element_blank(),
    panel.background = element_blank()
    ) 

The rotation plot, Figure 1, shows that principal component 1 could be a potential measure for how condensed a county center or metropolitan is. This assumption can be drawn from how popdensity has a positive relationship whereas area and percbelowpoverty show a negative relationship. All of the variables, popdensity, area, and percbelowpoverty have a positive relationship with principal component 2 with popdensity being the most extreme relationship. Principal component 2 could be a potential measure for county development.

As you can see in the scatter plot, Figure 2, Cook County is labeled as the top-rightmost data point– revealing an extremely positive relationship with both principal component 1 and principal component 2. This validates our findings and our assumptions drawn from Figure 1.